Lab2 Exploring Image Data

1.Buisiness Understanding

Give an overview of the dataset. Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). What is the prediction task for your dataset and which third parties would be interested in the results? Why is this data important? Once you begin modeling, how well would your prediction algorithm need to perform to be considered useful to the identified third parties? Be specific and use your own words to describe the aspects of the data.

I got this dataset from kaggle. There are 3710 with-mask images and 3828 without images in two directories. To meet the requirement of this assignment, I would do some manipulations before I process the image datas. The task I choose this dataset is that I need to determine whether a person is wearing face mask from the image captured.

Due to the pandemic, it's essential for everybody to wear face mask in public places. A face mask detection system is meaningful. For example, it could be applied for entry validation when someone try to enter some public places or friendly reminder. In addition, we can pay for the machine to do this job instead of human since I saw number of these people in front of the entrance of shopping malls and hotels or somewhere near the entrance to ensure anyone who wants to enter the place to wear face mask. For business use, the accuracy of this mask detection should at least be greater than 90%. However, since there are little noice factors to this dataset, which is somewhat different from the real circumstance, it would be meaningful if the accuracy is up to 97%.

2.Data Preparation

2.1 import the lib needed

2.2 Importing the requirement image files as dataset

Since I got this images classified yet and with different sizes, I had to resize them to 100*100 and move some of them to another directory as the train dataset.

2.3 Displaying several resized images

2.4 reading image file into numpy array

Linearize the images to create a table of 1-D image features (each row should be one image)

2.5 Display several greyscale images from np_data

3.Data Reduction

3.1 Perform linear dimensionality reduction of the images using PCA

300 is a neat value for PCA components since it could explain roughly 97% data.

3.2 Perform linear dimensionality reduction of randomized PCA

Similar to the PCA case, 300 components could explain 97% dataset.

3.3 Compare the representation using PCA and Randomized PCA

Based on less time consumption, I'd prefer the Randomized PCA because we can't make people wait too long for mask detection. Besides, we still can use human detection as an auxiliary means.

I have run tens of times and come to a roughly conclusion that the Randomized PCA is advanced in less time consumption. For accuracy, sometimes the Randomized PCA is better while sometimes it is worse. If I have to choose one of them for the mask detection task, I may choose Randomized PCA because of its less time consumption and make human detection as an auxiliary means. However, my customer may not pay for this because the accuracy is less than 90%, so both of them may not be suitable.

3.4 Perform feature extraction

Using daisy for feature extraction in this part and following the example of "04 Dimension Reduction and Image.ipynb"

Since the accuracy of daisy is greater than 90%, my customer may pay for it, but it has room to optimize due to its time consumption.

I also tried the feature extraction of gabor, the accuracy is lower than PCA and thus not recommended.

4.Exceptional Work (1 points total)

One idea (required for 7000 level students): Perform feature extraction upon the images using DAISY. Rather than using matching go the images with the total DAISY vector, you will instead use key point matching. You will need to investigate appropriate methods for key point matching using DAISY. NOTE: this often requires some type of brute force matching per pair of images, which can be computationally expensive.

I compare each image in the test dataset with every image in the train dataset by key points matching and take the maximum matching classes as the similar image to predict the result.

The accuracy of key points matching algorithm is close to the K Nearest Neighbor one. There may be room for optimization. However, I don't know how to evaluate the effect of feature extraction and matching algorithm, so I have no idea of the upper limit of these two stages. Besides, the performance time is not acceptable for business use. Maybe the afterward course would introduce more interesting algorithm to improve the accuracy.